Goto

Collaborating Authors

 div class




Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Lu, Yuxuan, Huang, Jing, Han, Yan, Yao, Bingsheng, Bei, Sisong, Gesi, Jiri, Xie, Yaochen, Zheshen, null, Wang, null, He, Qi, Wang, Dakuo

arXiv.org Artificial Intelligence

Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.




SCORE: A Semantic Evaluation Framework for Generative Document Parsing

Li, Renyu, Yepes, Antonio Jimeno, You, Yao, Pluciński, Kamil, Operlejn, Maximilian, Wolfe, Crag

arXiv.org Artificial Intelligence

Traditional document parsing architectures employ deterministic pipelines that sequentially combine optical character recognition (OCR), layout analysis, and rule-based table extraction to produce structured outputs. The evaluation of these systems has relied on well-established task-specific metrics including Character Error Rate (CER) and Word Error Rate (WER) [14, 20], Intersection-over-Union (IoU) [4, 16], and Tree Edit Distance-based Similarity (TEDS) [31]. These metrics operate under the assumption of unique ground truth representations, rewarding exact matches while systematically penalizing any structural deviations. The emergence of multi-modal generative document parsing systems has fundamentally transformed this landscape. Vision Language Models (VLMs) such as GPT-5 Mini, Gemini 2.5 Flash, and Claude Sonnet 3.7/4 [22, 6, 1, 2], generate holistic document interpretations that integrate visual, textual, and structural signals in an end-to-end manner. Unlike their deterministic predecessors, these systems frequently produce outputs that are semantically correct yet structurally divergent. Consider a table containing merged cells: one system may represent it as a flattened token sequence preserving reading order, while another generates hierarchical HTML markup with explicit structural relationships. Both interpretations faithfully capture the semantic content, yet traditional evaluation frameworks treat them as fundamentally incompatible, systematically misclassifying valid alternative interpretations as parsing errors. This mismatch of the evaluation paradigm has significant practical implications.


P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

Sun, Tao, Pan, Enhao, Yang, Zhengkai, Sui, Kaixin, Shi, Jiajun, Cheng, Xianfu, Li, Tongliang, Huang, Wenhao, Zhang, Ge, Yang, Jian, Li, Zhoujun

arXiv.org Artificial Intelligence

Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents-for visual element processing, content generation, and final poster assembly-each integrated with dedicated checker modules to enable iterative refinement and ensure output quality. To foster advancements and rigorous evaluation in this domain, we construct and release P2PInstruct, the first large-scale instruction dataset comprising over 30,000 high-quality examples tailored for the academic paper-to-poster generation task. Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper-poster pairs and a dual evaluation methodology (Universal and Fine-Grained) that leverages LLM-as-a-Judge and detailed, human-annotated checklists. Our contributions aim to streamline research dissemination and provide the community with robust tools for developing and evaluating next-generation poster generation systems.


Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Xu, Kai, Mao, YiWei, Guan, XinYi, Feng, ZiLong

arXiv.org Artificial Intelligence

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5 to 10 years of experience, each presents a significant challenge. On average, a single project takes 4 to 8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower (better) than SWE-Bench's Verified (65.4%) and Full (33.8%) scores. Finally, we discuss that in any development field, Standards and Frameworks represent foundational knowledge and efficiency tools, respectively, and LLMs require optimization tailored to them.


MageBench: Bridging Large Multimodal Models to Agents

Zhang, Miaosen, Dai, Qi, Yang, Yifan, Bao, Jianmin, Chen, Dongdong, Qiu, Kai, Luo, Chong, Geng, Xin, Guo, Baining

arXiv.org Artificial Intelligence

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent's knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at https://github.com/microsoft/MageBench.


DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation

Peng, Yi-Hao, Huq, Faria, Jiang, Yue, Wu, Jason, Li, Amanda Xin Yue, Bigham, Jeffrey, Pavel, Amy

arXiv.org Artificial Intelligence

Enabling machines to understand structured visuals like slides and user interfaces is essential for making them accessible to people with disabilities. However, achieving such understanding computationally has required manual data collection and annotation, which is time-consuming and labor-intensive. To overcome this challenge, we present a method to generate synthetic, structured visuals with target labels using code generation. Our method allows people to create datasets with built-in labels and train models with a small number of human-annotated examples. We demonstrate performance improvements in three tasks for understanding slides and UIs: recognizing visual elements, describing visual content, and classifying visual content types.